This page last changed on Mar 11, 2009 by straha1.
Tutorial

This page is part of a tutorial that explains how to

Table of Contents

Create Your QSub Script

This section explains how to run Parallel Hello World from this page using OpenMPI 1.2.6 with the PGI compiler. For instructions on how to run the program using other MPI implementations, see these pages: MVAPICH2, MVAPICH. Note that although OpenMPI is easy to use, the MVAPICH 1 & 2 implementations are faster.

Make sure you've compiled your MPI program with the same compiler+MPI combination that you use to run the program. Also, make sure you've used switcher to select that same compiler+MPI combination before submitting the job. If you've been following the directions from the first half of this tutorial, then you've already done that.

You cannot execute jobs directly on the cluster nodes yourself; you must have the cluster's batch system execute the programs for you. The cluster's batch system requires that you write a script and submit that script using the qsub command. The script should contain the commands that you need to have executed on the cluster nodes. There are three different queues to which you can submit your jobs. One queue is the testing queue, which is intended for short-lived test jobs for debugging. Please submit your jobs to the testing queue until you are sure that your code is working. Once it is working, you can use the low_priority or high_priority queues to run your job on more machines for longer periods of time. See this page if you want explanations of the differences between the three queues.

Run a Test Job

Jobs are limited to four processes in the testing queue and can be accessed by nodes=1:ppn=4 and -np # with # being 1, 2, 3, or 4 to get # many processes.

The below script will run a single-machine test job on our cluster's testing queue. That queue is intended to be used for debugging, and so we only allow short-lived, single-machine jobs. We request only two processors in the below script, but you can request up to four using the ppn= option: ppn=1 will give you one processor, ppn=2 gives you two, and so on up to ppn=4 for four. In the next section, we will discuss how to run larger, longer-lived, non-test jobs in the low_priority or high_priority queues.

#!/bin/bash
: The above line tells Linux to use the shell /bin/bash to execute
: this script.  That must be the first line in the script.

: You must have no lines beginning with # before these
: PBS lines other than the /bin/bash line
#PBS -N 'hello_parallel'
#PBS -o 'qsub.out'
#PBS -e 'qsub.err'
#PBS -W umask=007
#PBS -q testing
#PBS -l nodes=1:ppn=2
#PBS -m bea

: Change our current working directory to the directory from which you ran qsub:
cd $PBS_O_WORKDIR

: Run the mpi program
mpirun --machinefile $PBS_NODEFILE --np 2 ./hello_parallel

I will explain the meaning of all of that in the next section, when I show you how to run real, non-test jobs.

Now that you've created your qsub script for your single-machine test version of hello_parallel, it's time to run it with this command:

qsub hello_parallel.qsub

which should print out something like:

Job number is 3172.hpc.cl.rs.umbc.edu

eventually the job will finish and produce the files qsub.err and qsub.out. The qsub.out file should contain the following:

hello_parallel.c: Number of tasks=2 My rank=0 My name="devnode001.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=2 My rank=1 My name="devnode001.cl.rs.umbc.edu".

If you've been following the Fortran 90, C++ or Fortran 77 tutorials, the .c will be .f90, .cc or .f, respectively and there might be extra whitespace (spaces).

Submit a Multi-Machine Non-Test Job

Jobs in the testing queue can only use one machine and can only run for short periods of time, and so your mighty hello_parallel program will need to run in the normal low_priority queue in order for it to reach its full potential. You will need to modify the -q testing line so that it reads -q low_priority instead:

#!/bin/bash
: The above line tells Linux to use the shell /bin/bash to execute
: this script.  That must be the first line in the script.

: You must have no lines beginning with # before these
: PBS lines other than the /bin/bash line
#PBS -N 'hello_parallel'
#PBS -o 'qsub.out'
#PBS -e 'qsub.err'
#PBS -W umask=007
#PBS -q low_priority
#PBS -l nodes=5:ppn=4
#PBS -m bea

: Change our current working directory to the directory from which you ran qsub:
cd $PBS_O_WORKDIR

: Run the mpi program
mpirun --machinefile $PBS_NODEFILE --np 20 ./hello_parallel

If you have access to the high_priority queue, you can submit your job to that queue instead by replacing PBS -q low_priority in that script with PBS -q high_priority. Now that you've created your qsub script for a multi-machine, non-test version of hello_parallel, it's time to submit it to the queue with this command:

qsub hello_parallel.qsub

which should print out something like:

Job number is 3172.hpc.cl.rs.umbc.edu

eventually the job will finish and produce the files qsub.err and qsub.out. You might want to check out this page to see how to monitor your job while it is in the queue or running. Once the job completes, you should see the following in qsub.out:

hello_parallel.c: Number of tasks=20 My rank=0 My name="node032.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=1 My name="node032.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=8 My name="node030.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=12 My name="node029.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=16 My name="node028.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=2 My name="node032.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=9 My name="node030.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=17 My name="node028.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=3 My name="node032.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=10 My name="node030.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=13 My name="node029.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=18 My name="node028.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=4 My name="node031.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=11 My name="node030.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=14 My name="node029.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=19 My name="node028.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=5 My name="node031.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=15 My name="node029.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=6 My name="node031.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=20 My rank=7 My name="node031.cl.rs.umbc.edu".

If you've been following the Fortran 90, C++ or Fortran 77 tutorials, the .c will be .f90, .cc or .f, respectively and there might be extra whitespace (spaces).

What does everything in that script mean?

The mpirun line uses OpenMPI's mpirun script to execute your MPI job, while the rest of the script is used by Qsub to decide where on the cluster your job runs. More information about qsub can be found here: Using QSub. Excerpts from that page are below.

The --machinefile $PBS_NODEFILE option tells OpenMPI where to find the list of machines allocated to your job. The --np 20 tells OpenMPI how many processes to spawn. It is important to make sure that is equal to the number of nodes multiplied by the number of processor cores per node (4). If you ask for more processes than you have processor cores, there will be a gigantic slowdown due to how OpenMPI is designed. You can ask for less processors than you have been allocated, but you have to be careful how you do that – see the bottom of this page for details.

Option Effect
-N 'hello_parallel' This sets the name of the job; the name that shows up in the "Name" column in qstat's output. The name has no significance to the scheduler. It exists simply for your convenience.
-o qsub.out -e qsub.err This tells qsub where it should send your job's output stream and error stream, respectively. If you do not specify these, your output stream will be sent to JOB_NAME.oJOB_NUMBER where JOB_NAME is the name specified by -N and JOB_NUMBER is the job number assigned by PBS. This number is printed out when you run qsub. Similarly, the error stream is set to JOB_NAME.eJOB_NUMBER if you do not specify another location. If you do not want one of those streams, set the file name to /dev/null
-l nodes=N:ppn=P This option does not mean what you think it means; it does not request N P-processor nodes. Instead, it requests N groups of P processor cores, where each group of P processor cores is on the same machine. Thus nodes=5:ppn=2 will might give you four processors on each of two machines and two processors on a third machine, or four on one machine and two on each of three machines. This option is complex issue, and is explained in more detail below.
-m bea If this was implemented on HPC, PBS would email you when your job reaches certain states. The b means "email me when my job starts running". The e means "email me when my job exits normally." The a means "email me if my job aborts." Unfortunately, this functionality does not currently work on HPC.
-l walltime=HH:MM:SS This option sets the maximum amount of time PBS will allow your job to run before it is automatically killed. The "HH:MM:SS" should be replaced with the maximum amount of time that you will let your job run, in hours, minutes and seconds. Generally, the smaller timespan you ask for, the sooner your job will run. By default, when this page was written, PBS gave your job four hours if you did not specify the walltime.
-q queue_name Set the queue in which your job will run. Currently, the only queues are testing, low_priority and high_priority.

The nodes=5:ppn=4 is misleading since it does not request five machines with four processor cores each. The nodes=5:ppn=4 line requests five groups of four processor cores, where all four cores in each group are on the same machine. That is a subtle but important difference. If you had typed simply nodes=5 (which is equivalent to nodes=5:ppn=1), you would not be given one processor core on each of five different machines. You would get five processor cores somewhere on the cluster. Similarly, nodes=5:ppn=2 would give you five pairs of processor cores, somewhere on the cluster (but both processor cores in each pair will be on the same machine as one another).

PBS is free to allocate those sets of processors wherever it wants, and so nodes=5:nodes=2 might give you four processor cores on each of two machines and two on a third machine, or it might give you two processor cores on each of five machines, or perhaps four on one machine and two on each of three other machines. In all of those cases, PBS has done exactly what you told it to: it gave you five pairs of processor cores, where both cores in each pair are on the same machine.

Since the machines on the cluster have four processor cores each, you can ensure that you get five machines all to yourself by specifying nodes=5:ppn=4. That requests five groups of four processor cores, where all four processor cores in each group share are on the same machine as one another. Since all of our machines have exactly four processor cores, this will give you five separate machines, and ensure that nobody else's jobs are running on those machines.

Carefully consider what to choose for your ppn and nodes options. If you only need one processor and only a gigabyte or two of memory (such as for small serial jobs), you should be polite to other users and use nodes=1:ppn=1. If you are running a serial job that needs a lot of memory then you should use nodes=1:ppn=4 to ensure that no other user uses up all of the machine's memory. Parallel programs that use multiple nodes should use ppn=4 to avoid another user's job using the same Infiniband card. Our job will only run for a few seconds so it's okay to use all processors on each of five nodes.

OpenMPI: Issues and Advanced Usage

Poor Performance

Currently, OpenMPI is the slowest MPI implementation, but it is the easiest to use. If fast message passing or high-bandwidth communication is critical to your application, you should check out MVAPICH2 or MVAPICH. If message passing takes up less than 10% of your program's runtime, then it is probably not worth switching MPI implementations since OpenMPI is generally easier and has more features.

Running Less than Four Processes Per Machine

Usually you should use four processes per machine as we have done in this tutorial. Eventually, you might want to run less than four processes per machine (such as if you decide to use OpenMP). It is not as simple as requesting fewer nodes in your -np option. Look here for instructions.

Environment Variables

OpenMPI does not forward environment variables to your programs (ie. hello_parallel) when it runs them on the compute nodes. Environment variables are things like your PATH and LD_LIBRARY_PATH. Your shell uses the PATH variable to find programs that you try to execute. Your programs require dynamic libraries – chunks of code that are shared between multiple programs (such as the implementation of C's printf or Fortran's write). Your programs use the LD_LIBRARY_PATH variable to find dynamic libraries that they need. For simple, self-contained programs like hello_parallel, you do not need to forward environment variables – the default values will suffice. If your programs complain that they cannot find libraries or other programs, then you will need to forward environment variables. See this page for details.

Document generated by Confluence on Mar 31, 2011 15:37